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Abstract 



A multidimensional non-parametric IRT model of test bias is presented, providing an explanation 
of how individually-biased items can combine through a test score to produce test bias. The 
claim is thus that bias, though expressed at the item level, should be studied at the test level. 
The model postulates an intended-to-be-measured target ability and nuisance determinants whose 
differing ability distributions across examinee group cause bias. Multiple nuisance determinants 
can produce item bias cancellation^ resulting in little or no test bias. Detection of test bias requires 
a valid subtest^ whose items measure only target ability, A long-test viewpoint of bias is also 
developed , 
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1 Introduction 



The purpose of this paper is to present an Item Response Theory (IRT) based conceptualization 
of test bias for standardized ability tests. By ^test bias" we mean a formalization of the intuitive 
idea that a test is less valid for rme group of examinees than for another group in its attempt to 
assess examinee differences in a prescribed latent trait, such as mathematics ability. It will be seen 
that test bias is the result of individually-biased items acting in concert through a test scoring 
method, such as number correct, to produce a biased test. In a subsequent paper of ours, this new 
conceptualization of test bias is used to undergird a new statistical test for psychological test bias 
(Shealy and Stout, 1990) Also, a large-scale simulation study (Shealy, 1989) has been conducted of 
the performance properties of this statistical procedure, in particular as compared with the Holland 
and Thayer (1988) modification of the Mantel- Haenzel test. 

We mention three distinct features of the conceptualization of bias presented herein. First, it 
provides a mechanism for explaining how several individually-biased items can combine through a 
test score to exhibit a coherent and major biasing influence at the test level. In particular, this 
can be true even if each individual item displays only a minor amount of item bias. For example, 
"word problems" on a '^mathematics test" that are too dependent on sophisticated written English 
comprehension could combine to produce pervasive test bias against English-as-a-second-language 
examinees. A second feature, possible because of our multidimensional modeling approach, is that 
the underlying mechanism that produces bias is addressed. This mechanism lies in the distinction 
made between the ability the test is intended to measure, called the target ability^ and other 
abilities influencing test performance that the test does not intend to measure, called nuisance 
determinants. Test bias will be seen to occur because of the presence of nuisance determinants 
possessed in differing amounts by different examinee groups. Through the presence of these nuisance 
determinants, bias then is expressed in one or more items. A third feature, also possible because of 
our multidimensional modeMng approach, is that a careful distinction is made between genuine test 
bias and non-bias differences in examinee group performance that are caused by examinee group 
differences in target ability distributions. It is important that the latter not be mistakenly labeled 
as test bias. 

The novelty of our approach to bias lies not so much with its recognition of the role of nuisance 
determinants in the expression of test bias, but rather in the explicit nuiltidiinonsional IRT modeling 



of bias, which in turn promises a dear and thorough understanding of bias- 

2 An Informal Description of Test Bias 

We begin with an informal definition of test bias. 

Definition 2.1. Test bias occurs if the test under consideration is measuring a quantity in addition 
to the one the test was designed to measure, a quantity that both groups do not possess equally. □ 

It is important to note that this notion of test bias grows out of the traditional non-IRT notion of 
test bias based on differential predictive validity. Papers by Stanley and Porter (1967), Temp (1971), 
and in particular Cleary (1968), exemplify this classical predictive view of bias. These studies used 
standardized tests to predict performance on a particular last, if the predictive link from test to 
task was different for the two studied groups, then test bias was suspected. Cleary (1968), in a 
seminal paper on test bias, studied bias in the pvediction of college success of black and white 
students in integrated colleges, using SAT verbal and math scores. Her intent was to determine if 
the expected first year GPA (grade point average) for Whites was different from that for Blacks, 
after the two groups had been matched on SAT score; hence, the linear regression of first year GPA 
on SAT verbal (or math) score was separately fit for both groups and compared. If the expected 
criterion (GPA) for those examinees attaining a particular test score (e.g., SAT combined score) 
were dilierent across group, the test score was considered a biased predictor of performance and test 
bias was deemed to be present. The purpose of the Cleary study was predictive; the regressions of 
criterion on test scof^ therein were compared across group to see if the test score equitably predicts 
the performance measured by the criterion. 

Our focus shifts hereafter to regressing test score on criterion. The purpose of the reversed 
regression is to corroborate that the prediction of a criterion by a test is equitable across group, 
thereby exposing the conceptual underpinning for IRT modeling of test bias - in particular our 
modeling of test bias. The regressions of test score on criterion are compared to answer the 
following question: are the average test scores for both groups the same after the groups have been 
matched on criterion performance? 

This shift to a corroborative point of view brings us again to the informal definition above. 
The difference across group in the regressions of test score on criterion (other than that caused 
by statistical error) is due to an undesirable causative factor other than the criterion; that is, at 



least some of the test questions must be mecisuring something in addition to what the criterion 
measures. Furthermore, this difference is due only to the undesirable factor, because the criterion 
has been held equal in the two groups. 

If in addition to the reversal of the regression variables just described, the criterion is now 
chosen as internal to the test instead of external to it, the concept of the internal cissassment of 
bias results. This internal criteria becomes the "yardstick" by which the test is judged biased or 
not; it is a portion of the test itself. The implicit assumption is that the "yardstick" portion of the 
test consists of items known to measure only what they are supposed to be measure. 

An example adapted from Shepard (1982) dearly illustrates this internally-assessed bias: a 
verbal analogies test is used to compare reasoning abilities of German and Italian immigrants 
to the United States, the two populations matched on English fluency. However, 20% of the test 
items are bzised on words with Latin origins, whereas the remainder have linguistic structure equally 
familiar to both groups. Here, the items with Latin origin words are possibly biased. A reasonable 
internal criterion with which to assess this bicLS would be a score bcised on the responses to the 
linguistically neutral items; for, it is assumed that these items are validly measuring what the test 
is intended to measure. 

The internal assessment viewpoint of test bias can be clarified by noting two distinctions between 
it and the classical test theory based differential regression conceptualization of Cleary and others: 

(A) The ^^yardstick" (criterion), which was a measurement of task performance (e.g., 1st year 
GPA in the Cleary study), is now a score internal to the test (e.g., score on the linguistically 
neutral items in the Shenard example). This internal criterion is most often an aggregate 
measure of a portion of the test item responses (typically number right). 

(B) The differential regression approach used regressions of the external criterion on test score in 
a predictive context. In internally-assessed bias studies, the responses of one or more items 
suspected of bias are regressed on the internal criterion as a corroborative statistical test that 
these ^'suspect" items are measuring the same thing that the interral criterion is measuring. 

This brings us to an essential question: what is the internal criterion measuring? It is mea- 
suring a theoretically postulated psychometric construct that is intended to be ger,eralizable to 
a variety of possible future tasks; i.e., a latent ability of an IRT model. Thus IRT modelling of 
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bias becomes appropriate. An example will illustrate: the SAT math test is designed to measure a 
construct, "mathematical ability", which is intended to predict an examinee's future success in a 
set of quantitatively-oriented college courses that require a component of such ability. Rather than 
assessing the SAT test against the corresponding set of criterion measures of performance in these 
courses, we wish to assess the test against the construct itself; to do so we turn to the test itself to 
verify that the proper measurement of mathematical ability is being done. The internal criterion 
measures this ability construct, and internal test bias is defined with respect to this construct, 

The generalizability of performance measurements on a variety of tasks to a single construct, as 
described above, provides one motivation to shift to internally assessed bias studies. An additional 
motivation is the practice in recent years of creating item pools^ large numbers of items that axe 
to be used in forming multiple versions of a standardized test (see, for example Hambleton and 
Swaminathan, 1985, Ch. 12). A newly constructed set of items intended for inclusion in the item 
pool can be tested for bias, relative to the ability construct that the pool is supposedly measuring, 
by employing internal bias detection techniques. 

Internally assessed bias studies with a variety of test populations have been done: Cotter and 
Berk (1981) attempted to detect bias in the WISC-R test with white and minority children. Dorans 
and Kulick (1983), in a series of studies done at Educational Testing Service, study the possible 
effect of differential mastery of written English between native born Americans and English-as-ar 
second-language Oriental students on scores of selected items on a mathematics "word problem" 
test. 

Item bias studies such as the ones above usually focus on single item at a time; if several items 
in these studies are simultaneously found to be biased, it is a result of statistical bias procedures 
conducted for each item separately, which iaises delicate questions about simultaneous statistical 
inference. Moreover, in a modeling sense, no causative reasons for the observed simultaneous bias 
are explored by item bias studies. This paper studies a form of test bias relative to an internal 
criterion; this kind of test bias considers the set of test items acting as a unit (via a common causal 
mechanism) and combining through a test scoring method. The precise formulation of test bias 
and a contrast of it to item bias is presented in Section 4. 

We now consider the question of test bias relative to an internal criterion more carefully. Con- 
sider a situation where a single verbal analogy item is embedded in two different tests, tests M and 
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V say. Test V is composed of verbal analogy items, as intended, and Test M consists of mathematics 
calculation items, as intended, except for the single embedded item. Assume that each item in Test 

V does not contain any culture-dependent material that may favor one group. The embedded verbal 
item is not biased in Test V, but the potential for bias of this item is large in Test because the 
item measures something other thetn the intended-to-be-measured mathematical calculation skill. 
This illustrates a key component of test bias, aptly stated by Mellenbergh (1983, p, 294): "An 
item can be biased in one set of items, whereas it is unbiased in another set." Shepard (1982) also 
points out this relativity feature; . . if a test of spatial reasoning inadvertently included several 
vocabulary items, they would be biased indicators of the [ability being measured]" and "... in a 
test composed equally of two types of items reflecting. . . two different [ability] constructs, it will 
be a dead heat to decide statistically which set defines the test [ability] and which set becomes a 
biased measure of it." 

Implicit in the above discussion is the assumption that a portion of the test defines the internal 
criterion by which the remainder is measured for the presence of bias. A collection of items defining 
the internal criterion will be called a valid subtest An informal definition of a valid subtest can 
now be given: A subtest is valid with respect to a specifi.ed "target" ability if the subtest score 
is judged to be measuring only the intended target test ability^ i.e., it stands as a "proxy" of the 
ability one intends to measure. More precisely, if all of the items of the subtest measure only the 
intended ability then the subtest is said to be valid. 

There is a point about this definition that needs mentioning. Primarily, the existence and 
identification of a valid subtest is an empirical decision based on expert opinion or data at least 
in part external to the data set in question. Subtest validity cannot be established based on the 
test data set alone nor can it be theoretically deduced. The "burden of proof" is an empirical one 
and lies with the t?st constructor. If all the items of a test depend on a second determinant (for 
example, if the responses to all items depend on familiarity with standardized tests) then a valid 
subtest will not exist. Note that this is true even if the two groups are not differentially penalized 
by this dependence of test items on familiarity with standardized tests. Thus, the actual presence 
of test bias is logically independent of the existence of a valid subtest to be used for the assessment 
of test bias. 

In our framework, it must be assumed that there is a valid subtest if we are to internally detect 



test bias; otherwise, it is intrinsically nondetectable internally. The responses to the valid subtest 
are used to tackle the central problem in the identification of test bias: the need to distinguish 
between group differences attributable to the ability construct intended to be measured and that 
due to unwanted ability determinants. Because the valid subtest is assumed to measure only the 
desired ability, then no measures external to the test axe required to assess that ability, although 
to improve accuracy it may be beneficial to also use external data, especially if the valid subtest is 
short or if the assumption of its validity is at all suspect. Matching examinees using a valid subtest 
score controls for group differences in the intended-to-be-measured ability and isolates differences 
due to the unwanted determinants. A more rigorous formulation of "valid subtest" is set out in 
Section 4. 

In these discussions of test bias relative to an internal criterion, multidimensionality has implic- 
itly been invoked; it is impossible to discuss test bias without invoking it. The informal definition 
of test bias stated above employs multidimensionality: there is mention of the quantity the "test 
was designed to measure" and one "in addition to" this quantity. Lord (1980, p. 220) vecognized 
this in his discussion of item bias: "if many of the items [in a test] are found to be seriously biased, 
it appears that the items are not strictly uni dimensional". 

Bias in one or more items has typically been attributed to special knowledge, unintended to 
be measured, that is more accessible to one of the test-taking groups. Ironson, Homan, Willis 
and Signer (1984) performed a bias study that involved planting within a mathematics test math- 
ematics word problems that required an extremely high reading level to solve them. They state 
their conclusion that ". . . bias is sometimes thought of as a kind of multidimensionality involving 
measurement of a primary dimension and a second confounding dimension". Our viewpoint here 
is that bias is always the result of multidimensionality. 

The "primary dimension" is referred to in this paper as target ability, because this is the ability 
the test intends to measure. The "confounding dimension" is referred to as a nuisance determinant. 
In the Shepard verbal analogies example above, the target ability is reasoning ability, which 80% of 
the items solely measure, while the nuisance determinant is familiarity with Latin linguistic roots. 

The full formulation of test bias is set out in Section 4Ht involves certain subtleties not dis- 
cussed here. The group differences in ability level of a latent nuisance determinant provide a 
common causative mechanism for bias in any collection of items on a test contaminated with such 
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a determinant. This is the single most important conceptual difference between the test bias model 
developed in this paper and previous item bias work: the existence of a postulated common latent 
cause for the manifestation of bias across a group of test items. 

3 The IRT Model for Test Responses 

Herein we present the nonparametric multidimensional IRT model underlying our modeling of test 
bias. Consider a group of Q of examinees; the sample of examinees to take a test is considered to 
be drawn at random from this population. A test is simply a collection of items; a test response of 
length A'' is the corresponding set of responses, for a randomly-chosen examinee from denoted 

by 

= (3-1) 
where the Ui are random variables taking on 




0 if response to item i is incorrect; 

1 if response is correct. 



The IRT model is composed of two components that generate £/• (1) a d-dimensional exami- 
nee ability parameter and (2) a set of item responses functions (IRFs), one for each item, which 
determine the probability of correct response for the items. The IRT model is usually conceived as 
a unidirnensional (d = 1) model; here, a multidimensional {d > 1) model will be presumed. 

Let us now further set notation. The ability vector is 

e = i0u...M (3-2) 

for an arbitrary examinee from Q, A distribution of £ over Q is induced by choosing examinees at 
random from G\ the multivariate random variable is designated 

0 = (0i,...,0d) (3-3) 
Examinee independence is aissumed; i.e., J examinees from Q have ability parameters 

m(;):; = i,...,^} 

independent and identically distributed (iid) in j. Item t's IRF, which is interpreted as the proba- 
bility that an examinee with ability 0 will answer item i correctly, is denoted: 

F,(£)=P[C/. = l|^ = £] = P[t/.=:l|£]. 

9 
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Our interpretation of Pi{9) is the sampling one: among all examinees having ability £, the expected 
proportion of them getting item i correct is Pi{0). 

The basic philosophy of the IRT model is that a latent distribution of abilities in a Group Q 
drives the manifest distribution of item responses. The fundamental identity relating the responses 
to the examinee group ability variable is 

PlU = u] = fe_P^Jl = u\Q--=9]dF{9), 

for all tz = (tzi, . . . ,tz;v)j ( each tz,- = 0 or 1), 

where F{*) is the cumulative distribution function (cdf) of 0^. There are two fundamental assump- 
tions on the conditional test response probability P\ll = u \ 0] ^ = n | 0 = £J usually 
assumed in IRT modeling. To introduce these, recall two standard definitions about ordering in 
d-dimensional Euclidean space: (i) Let z and z! be vectors. Then i < I'if Zi < z\ for z = 1, . . . d 
and for at least one i, Zi < z-. (ii) Let 2 and z' be vectors. The real valued function f{z) is strictly 
n.onn'jne if for any z<z!^ f{z) < f{z!). 
The fundamental IRT assumptions are: 

Assumption 3,1. Local independence in 9: for every 9^, 

N 

P[U =: uj^ = J] P[Ui ^Ui\9] for aii u,- = 0 or 1; : = 1, . . . , N. (3 - 5) 

1=1 

Assumption 3,2- Strict monotonicity of IRFs: The item IRFs {Pi{9) : i = 1, . . . , :V} are strictly 
monotone in 9. That is, for any t, > P,(£) if 9* > 9 in tiie sense of (i) above. 

It is convenient to combine (3-4) and Assumption 3.1 in the following manner: 

r ^ 

for all u. 

The notion of the dimensionality dof H can be mathematically formalized but for the purposes 
of this paper it is unnecessary to do so. 

Definition 3.1. Let iZ. be a test response as in (3-1). An IRT representation of H'ls the structure 

{d.Q^.F{9),{P,{9)'> z=l,...,AO} (3-7) 
where (3-4), Assumption 3.1. and Assumption 3.2 hold. □ 

10 Ic' 



In this paper we often warn to consider a test item's operating characteristic with respect to a 
specific single component of ^ (say ^i). This is accomplished by "marginalizing out" the remaining 
components in the £-vector from the item's IRF, resulting in the marginal item response function 
(marginal IRF). Conceptually, this IRF is a unidimensional reduction of the original one and can 
be considered as a unidimensional IRF for the unidimensional ability 6i. The following definition 
is due to Stout (1989). 

Definition 3.2. Let P(£) be an IRF. The marginal IRFT{6i) ofP{0) with respect to 0i is defined 
by 

r(di) = jE;[P(0)|0i = <?i]. □ 

The marginal IRF is essential in the discussion of modeling test bicis in Section 4, where a single 
component of ^ designated cis the target ability m\\ be considered. Because target ability is the 
ability the test designer desires to measure using the items, the marginal IRF with respect to this 
ability is a useful concept. 

In order for T{9i) to be an IRF it must be strictly monotone; this does not follow for the 
marginal IRFs of a test from the assumptions of our IRT representation (3-7). However, very mild 
regularity conditions suffice to produce strict monotonicity, as has been shown by Stout (1989). To 
this end, we need the concept of stochastic ordering. 

Definition 3.3. Let Z_ be a random vector with distribution indexed by a parameter 7. Z is 
strictly stochastically increasing in 7 if for every z in the range of Z 

P\Z>z;j]<P[Z>z;Y]i{^<Y^ 

Strict monotonicity of the marginal IRF v/ith respect to ^1 follows under the reasonable as- 
sumption of stochastic order in 0i; 

Theorem 3.1. (See Stout, 1989), If Q\Qi = ^1 is strictly stochastically increasing in d\ in the 
sense of Definition 3,3 and the IRF P{0) is strictly monotone in (^2,-- • ,^d) then the marginal IRF 
of P{6) with respect to 0^ is strictly monotonia 

Remark. Note in Theorem 3.1 that P{0) is not assumed to be strictly monotone in the first 
component of ^ = (^1, ^2^ - » -•^cf )• 
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A note on IRT model assumptions should be emphasized here, IRT models are commonly pa- 
rameterized; that is, the IRFs and ability distribution are members of parametric families. Typical 
assumptions are that 0 is unidimensional with a standard normal distribution and that a two or 
three parameter normal ogive model or a one, two, or three parameter logistic model is assumed 
for the IRFs. In this paper, we assume only that the IRFs {Pi{i)} are continuous, with 6 usually 
multidimensional. 

4 Test Bias in the IRT Model 

In this section our multidimensional IRT based notion of t^st bias using the IRT model of Section 3 
is developed. Section 4.1 provides a brief presentation on IRT item bias as currently usually defined 
in the psychometric literature. Section 4.2 sets up the multidimensional IRT framework for test 
bias modeling; target ability and nuisance determinants are defined. Section 4.3 develops test bias 
in terms of its components: potential for bias, expressed bias, and the combining of expressed 
item biases through a test scoring method. Section 4.4 considers item bias cancellation when the 
nuisance determinants are multidimensional. Finally, Section 4.5 formally considers the notion of 
a valid subtest. 

4.1 Existing IRT Item Bias Definition 

In this section the concept of IRT-modeled item bias (in some contexts called DIF, for differential 
item functioning) currently in widespread use is presented as a backdrop for the development of 
multiple-item test bias, which is treated in Sections 4.2 and 4.3. An item is biased, according 
to Hambleton and Swaminathan, (1985, p. 285) if its (necessarily unidimensional) item response 
functions across groups are not identical. A formal definition is given below. 

Definition 4.1. Item bias. Let two groups of examinees be indexed by g - 1,2, For each denote 

JL,^{Uu,...,Un,) (4-1) 

to be the test response from an N-item test for a raiidom/y chosen exam/nee from Group g. Assume 
t/iat a Jin/d/mensjona/ IRT model fits each gi^oup, with IRT representation for {Ilg',g - 1,2} (recall 
Definition 3.1): 

{f/= 1,05,F^(0).{P;(^) : j= ] ! - 1,1+ 1 ;Y;P.,(^)}.(/ = 1,2}, (4-2) 

12 14 



where Fg{9) denotes the cdf of Qg, (NolCy as the subscript notsition indica.teSy that aiJ items except 
the ith item have group invariant IRFs while item i has un IRF that possibly differs for the two 
groups.) 

(i) Item bias occurs in item i at 6 if the group specific probabilities of correct response at 6 are 

different; i.e., the group IRFs are different at 9: 

PiiiO) = P[Uii ^l\Qi^9]jL P[Ui2 - 1|02 = 5] = Pi2{9). 

(ii) Item bias occurs in item i if there exists some value 9 for which item bias occurs at 6. □ 

It is important to observe that the "bias" of item t is defined relative to the other N - I items, 
which are assumed invariant and hence "unbiased'^ with respect to the two groups. 

Item bias models have traditionally been parametric. Wright, Mead and Draba (1976) and Hol- 
land and Thayer (1988) consider a biased item generated by Rasch IRFs with the IRF difficulties 
(6's) different for the 2 groups. The more general 2FL and 3PL models, with different discrim- 
inations (a's) and guessing parameters (c^s) across group, have been studied by Hulin, Drasgow 
and Komocar (1982), Linn, Levine, Hastings, and Wardrop (1981), and Thissen, Steinberg, and 
Wainer (1988), among many other:. 

Item bias addresses differential performance across group for a single item at a time. If several 
items display bias relative to the remaining assumed group invariant items according to Defini- 
tion 4.1- modified to allow several IRFs to possibly differ across group-there are no components in 
Definition 4.1 that provide the facility to explain simultaneous item biasing due to a single under- 
lying reason. This provides the motivation for an IRT framework that explains such pervasiveness 
of item bias, 

4.2 The IRT Framework for Multidimensional Test Bias 

In our treatment, test bias is modeled using the nonparametric multidimensional IRT framework 
describ«3d in Section 3. The multidimensionality of the underlying latent abilities for the two groups 
provides the environment from which bias expresses itself in one or more items. A crucial component 
in this test bias model is the modeling of a pervasive nuisance determinant, which contaminates a 
significant portion of the test items. This modeling viewpoint is an attempt to retain the view that 



bias originates at the test question level yet to allow for the possibility of bias expressed through a 
test score as in the classical differential regression approach discussed above in Section 2. 

The setup of the multidimensional IRT model for a test administration to two groups is as 
follows. The IRT representation (3-7) is assumed to hold for the combined two-group population 
of examinees. This representation induces a separate IRT representation of the form of (3-7) for 
each of the two groups: 

{^,0,, m)APM) ' i = 1. • • - 1 2, (4-3) 

where 0^ here denotes 0 restricted to Group g, Fg{9) denotes the cdf of 0^, and Pxg{9) denotes the 
tth IRF for a randomly selected examinee from the subpopulation of Group g examinees of ability 
6, Note that the distribution of 0^ will in general be different from that of Oj. It is convenient to 
denote the combined two group IRT representation by 

{d.Qg.FM)APiM)' 5=1,2}. (4-4) 

The IRT representation (4-4) will be assumed throughout the remainder of Section 4 (with (3-4), 
Assumption 3.1, and Assumption 3.2 assumed to hold within each group of course). Implicit in 
(4-4) is the assumption that the test measures the same psychometrically-definec? ability construct 
9 in both groups. 

Two basic assumptions acMitional to Assumption 3.1 and 3.2 about the IRT representation (4-4) 
are necessary: (1) common multidimensional IRFs in for each of the two groups in the representation 
(4-4) (i.e., IRF invariance across group) and (2) the capability of the test to measure (possibly with 
contamination) "he intended-to-be-measured ability {target ability): 

Assumption 4.1. In the assumed IRT representa.tion (4-4) assume IRF group invariance^ that is 

PM) - Pi2{e) = Piii) (4 - 5) 

for ail £. □ 

This first additional assumption states that the usual IRT item parameter invariance assumed 
in unidimensional IRT modeling is a^ssiimed to hold for our multidimensional IRT model, where 9 
includes all the abilities influencing test performance (hence the assumption of IRF group invariance 
is appropriate in this context). Such invariance does not necessarily hold for any subset of the 
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components of in particular not for the target ability alone* Indeed if invariance with respect to 
target ability held for aU items it is intuitively cbar there could be no bias. For example, in the 
usual definition of item bias (Definition 4*1) invariance is not assumed for the biased item. (4-5) is 
assumed throughout the rest of the paper. 
We now define target ability. 

Definition 4.2. Target ability is the unidimensionsil latent SibiUty the test intends to measure. 
The target ability component is denoted by 9, and the target ability random variable for Group g 
is denoted Qg. 

Remark, 0^ is not to be confused with as defined in (4-4). 

If a discussion of test bias is appropriate in a test administration, then it must be the case that 
the test is designed so that it is in fact measuring 9, as well as possibly some nuisance components 
inadvertently. We thus informally make the second additional assumption that all items of the 
test in fact do measure target ability 9 and possibly nuisance components ij as well. That is, all 
IRFs Pi{6,T)) assumed strictly increasing in 9 throughout the paper. In Shealy (1989), this 
assumption is formalized and it is then proved that the existence of a representation (4-4) in turn 
implies the existence of an analogous representation in terms of (5,2); that is in terms of target 
ability and nuisance components. Here we bypass presentation of this formalism and instead assume 
an IRT representation of the form (4-4) with 

% = {Q9^%) (4-6) 

where 0^ denotes target ability and 77^ denotes nuisance ability for a randomly chosen group g 
examinee. That is, the two group IRT representation 

K(0p,!?,),^p(^,!?).{^t(^,2). t = l N}: 5 = 1,2} (4-7) 

where the Pi's are the group invariant IRFs guaranteed to exist by Assumption 4.1, is assumed 
throughout the remainder of the paper. 

4.3 A Multidimensional Formulation of Test Bias 

Iten bias postulates that examinees scaled on a univariate latent 9 (as in Definition 4.1) display 
differing item response probability across group for the biased item. We Wjili take the postulated 
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ability 5 to be the target ability to create an IRT-based definition of test bias. 

As in item bias studies, test bias of this sort is an entity studied at the "micro leveP of each 
fixed value of 6\ so one may speak of "test bias at Test bias at the "macro level" may be defined 
to exist if it exists at one or more single ^-values; important aspects of this micro/macro duality are 
considered in Section 6. The following formulation of test bias is composed of three components: 

(a) The potential for bias^ if it exists, resides within the multidimensional target/nuisance ability 

distributions in two groups; 

(b) potential for bias is ex^Tessed ,n items whose responses depend on one or more nuisance 
determinants; anJ 

(c) the scoring method of the test, to be viewed as an estimate of target ability, transmits expressed 

item biases into test bias. 

4.3,1 Potential for test bias 

Before the concept of "potential for test bias" can be developed, it is necessary to introduce con- 
ditions postulating stochastic ordering of ability distributions. 

Consider a nuisance ability t;^, assumed unidimensional for simplicity of explication, for two 
groups p,p = 1,2. Either the distribution is the same for both groups or, by definition, there exists 
some T) for which 

P[Vi >ri]^ P[V2 > V]^ 

Say that, as psychometricians, we believe that Group 1 has "more" of this ability. Likely the most 
natural way to mathematize this belief is to assume stochastic ordering, that is to assume 

for all Tj, For rj^ and r]2 that possess densities, the graphical intuition is given in Figure 1. For 
example, as Figure 1 suggests, the densities might be identical except for translation. Of course, 
if two groups differ in ability distribution, it does not follow logically that one or the other group 
has "more" ability. For example, a situation where the variances of t/^ and t;^ are not equal can 
produce 

P[t]^ > 1]] < P[7]2 > //] for 7/ > 0 
O l(i 
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and 

P[Vi <v]< P[V2 < V] for r; > 0. 

In paxticiilax, rj^ and r/2 might be symmetrically distributed about 0 with t/j having the larger 
variance, as displayed in Figure 2. Nonetheless, for many psychometric applications, it seems 
plausible to assume stochastic ordering whenever ability distributions are not equal, as we will do 
below. 

The potential for test bias is modeled via one or more determinants that simultaneously cause 
bias in a collection of items. In particular, this cause is rooted in the conditional distributions of 

I ®y = ^ (note that 77^ can be multidimensional here). For a fixed ^, we assume stochastic 
ordering for the distributions of r;^ | 0^ = ^(y = 1,2) when they are not equal: 

Assumption 4.2. Let {Og.ri^) be as in (4^7) and a target ability value 6. If the conditional 
distributions t/JOi - 6 and t;^ | ©2 = ^ are different, then the assumption is that either 

(!7jOi = ^)<(r7j02=^) or (r/JOi = ^) > (t7^|02 = ^) (4 8) 
stochastically; i.e., either 

Plv^ > I? I 01 = ^] < P[t72 > I? I 02 - ^] (4 - 9) 

for all 2 

P[V, > r? I 01 = ^] > P[t72 > r; I 02 = ^] (4 - 10) 

for all 2' □ 

For example, let 9 be mathematical ability and r; = r; be verbal ability. Then (4-9) says among all 
examinees of Mathematical Ability 6 that, stochastically, Group 2 examinees are verbally superior 
to Group 1 examinees. 

With the above preparation, potential for test bias can be defined. 

Definition 4.3. Let two groups have ability distributions (0i,27i) and {Q^.V^)' Potential for test 
bias exists with respect to nuisance determinant r; at target ability level 6 if either (4-9) or (4-10) 
holds. If (4-9) holds a potential disadvantage exists against Group 1 at target ability 6. □ 

Definition 4.3 implies that a potential disadvantage can exist only if there is a nuisance deter- 
minant as a component of the latent ability vector. 



4.3.2 Expression of test bias potential 

In order for test bias to occur, its potential must be expressed in one or more items. The concept of 
expressed bias, detailed in Definition 4.5 below, is similar to the item bias concept of Definition 4.L 
It is stated in terms of the marginal IRFs with respect to target ability: 

Deflnition 4.4. -Refer to (4- 7j. The md^rginal IRF 

,f Ti,{9) = E[Pi{e,,ri^)\Q,^e] 

is called the t^,rget marg/naJ IRF for item i, Group g, □ 

We can now define expressed bias in item : at target ability 6, 

Definition 4.5. Let {Tig{9) : : = 1, . . . , iV} be Group g^s target marg/naJ IRFs for a test with IRT 
represen tatjon (4-7). 

(i) Expressed bias in item i exists at target ^.biUty Q if item i's target marginaJ IRF for Group 1 

is not equal to the corresponding ta.rget ma,rginal IRF for Group 2 at 6: 

(ii) Expressed bias in item i exists if there is some value 6 for which expressed bias for item i exists 

at 9, 

Item i is biased against Group 1 at 9 ifTii{9) < Ti2{9). □ 

Definition 4.5 (our multidimensional IRT expressed item bias definition) is equivalent to Defi- 
nition 4.1 (the usual IRT item bias definition) if 

(i) ihe IRT models represented by (4-2) and (4-7) are both IRT representations of {U^ : g = 1,2}, 

(ii) the ability 9 of (4.2) is the target ability 9 of Definition 4.2, and 

(iii) the group-dependent IRF Pig{') from (4-2) is taken to be the target marginal IRF Tig{') from 
Definition 4.4. 

Henceforth in the paper, 'Mtem bias" will refer specifically to the expressed item bias of Defini- 
tion 4.5. 

The link between potential for bias and expressed bias for an item is the heart of test bia5. The 
following theorem is fundamental in establishing this link. 
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Theorem 4.1. Assume IRT representation (4-7) and fix the number 9. If Pi(9^r]) is strictly 
increasing in ^ and a potential disadvantage exists against Group 1 at 6 then item i is biased 
against Group 1 in the sense of Definition 4.5. 



Proof. The result is an immediate corollary of Theorem 3.1. 

Remark. In a sense, Theorem 4.1 formalizes the obvious; dependence of an item on nuisance 
determinants with respect to which one group is disadvantaged causes expressed item bias. 



4.3.3 Transmission of expressed item biases into test bias 

Until now the discussion has focused on a single item; we shall see that a test can consist of many 
items simultaneously biased by the same nuisance determinant. In this case, items can cohere and 
act through the prescribed test score to produce substantial bias against a particular group even if 
individual items display undetectably small amounts of item bias. 

This is the final component of our formulation of test bias mentioned at the beginning of this 
section. We consider the large class of test scores of the form 



where h{u) is real valued with domain all n = (ui, . . . ,ti^f) such that ti,- = 0 or 1 for i = 1, . . . , iV 
and h{u) is coordinate wise non-decreasing in u. This class contains many of the standard scoring 
procedures for many standard models; for example, number correct, linear formula scoring of the 
form ^iL\^<^iUi^ with a, > 0, maximum likelihood estimation of ability for certain logistic models 
with item parameters assumed known, etc. One is surely willing to restrict attention to test scores 
of the form (4-11), if the test's IRFs are known to be increasing. Following Rosenbaum (1985), test 
scores of the form (4-11) will be called nofi" decreasing item summaries. 
Test bias is defined with respect to a specific test scoring method /i(li). 

Definition 4.6. A test H with target ability 0 and test score h{ll) displays test bias against 
Group 1 at 6 if 



(4-11) 



E[h{Il,)\Qi=9]<E[h{LL,)\Q2=e]. 



(4-12) 
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If 



E[h{Jl,)\Q,=d] = E[hm,2)\Q2 = e] 



(4-13) 



then no test bias exists at 9. 



a 



The psychometric interpretation of Definition 4,6 is as follows. The left side of (4-13) is the 
expected test score for a randomly chosen Group 1 examinee with target ability 6 while the right 
hand side is the same for a randomly chosen Group 2 examinee with target ability 6. In order to 
<tssess the appropriateness of Definition 4.6, consider a large number of Group 1 and a large number 
of Group 2 examinees taking the test, all of target ability 9, Then (4-13) says that the average 
score of these Group 1 examinees will be approximately the same cis that of the Group 2 examinees. 
Thus, on average^ neither group is favored in the attempt to estimate target ability using hiJTg). 

4.3.4 A fundamental relationship 

We now elucidate how the three conceptual components of our formulation interact to produce test 
bias. For ease of interpretation we restrict ourselves to the case of a unidimensional t;; however, 
the following results hold if a vector- valued nuisance determinant ^ is assumed. 

The basic test bias result is given in Theorem 4,2, namely the precise mechanism by which 
potential for bias is transmitted into test bias. First a variation of a well-known lemma is needed, 
which for convenience is specialized to the present setting. 

Lemma 4,1 • Let /(r;) be strictly increasing in r] and let stochastic ordering in the sense of (4-9) 
hold for each fixed 6. Then for each fixed 



E[f{vi)\Qi=e]<E[f{rj,)\Q2 = e]. 



Proof. Fix 9 and let Fg{ri) denote the cdf of f{Vj) \Qg = 0. Assume, for simplicity of argument 
and without loss of generality, that F,(0) = 0 for = 1,2 Then 




Integration by parts yields 




(4-14) 



20 2 3 



But (4-9) and /(t?) strictly increasing implies that 

Fi{x) > F2ix) for all X > 0. 

Using (4-14) and noting that 

\l-Fdx))dx< r (I - F,(x))dx. 



r 

Jo 



A roo 
( 
0 Jo 

the desired result follows. □ 
The theory of associated random variables is helpful in establishing the basic test bias result. 
As defined by Esary, Proschan, and Walkup (1967), a random vector 2L is associated if, and only 
if, for all nondecreasing f{x), p(x), it follows that 

cov{f{2QM2L))>0^ (4-15) 

The main result of Esary, Proschan, and Walkup (1967) that we wish to use is that a vector of 
independent random variables is associated. The basic result can now be stated and proved. 

Theorem 4.2. Assume IRT representation (4-7) with rj - V ^^i^S unidimensionaL Fix the number 
6 and assume the test scoring method of the form (4-11). Suppose for some i that h{u) is strictly 
increasing as = 0 increases to U{ = 1 and that Pi{B,r}) is strictly increasing in rj. Assume 
potential for bias at 9 against Group I, i.e., that (4-9) holds. Then test bias at 6 against Group 1 
holds. 

Proof. It suffices to prove (4-12). By IRF invariance with respect to (^, t;), it follows for all i] 
and the fixed 6 that 

E[h{U.,) I 01 = = 7?] = E[h{U.2)\Q2 = e,V2 = V] (4 - 16) 

Conditioning on Qg = 9, rjg = rf will be denoted by 6, r). Let 

f{rj) = E[h{IL,)\d,rjl 

Note that /(r/) does not depend on g by (4-16), hence let LL = LL\ throughout the remainder of the 
proof. We first show that fir]) is strictly increasing in t]. Fix vj^ > vj. Then, by local independence 
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Thus, q{u) is strictly increasing as n,- = 0 increases to n,- = 1 because P,(d, T]')/Pi{d, r/) > 1. Now 

E{h{iL)\9,v')-E{hme,v) = EuMji)(m=jti^.vi-m=2ti^.'?i) 

= EaMl£)[9(l£)-im = til^,r/] (4-17) 

Paxiition 

!L = {IL\Ui,Ui)-^{U:,Ui) 

where 

Let Ew and corw denote expectation and covariance over the distribution of respectively. By 
a basic identity for covariance, stated here conditional on (^,t;), 

cov{h{!LU{u.)-i\e,v) = Eu'{covuAhmMU.)--^\d,v]\o,v} 

Both h{u) and q{u) - 1 are strictly increasing as tzj = 0 increases to U| = 1. Thus, for all possible 
values of U', 

covuAhmM!L)'^\^^v]>o^ 

Thm, the first term on the right hand side of (4-18) is strictly positive. Because of the association 
of independent random variables and the fact that ^ given 9^ 7/ has independent components, it 
follows that the second term on the right hand side of (4-18) is nonnegative, using also the fact 
that 

EuMlL)\e,V)^-^d Eu,{q{U)-'^\0^r]) 
are nondecreasing in UJ. Thus, 

cov{hi!L),q{(l)-l\9,Tj)>Q. 

But, recalling (4-17), 

Eihi!L)\9,ff)-E{h{lL)\9,v)>0; 

that is, f{T]) is strictly increasing in 77, as claimed, Then, applying Lemma 4,1 and (4-9) to /(r/) 
above, it follows that (4-12) holds, as required. □ 
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Remarks. 

(i) It is important to reemphasize that Theorem 4.2 holds if a vector valued nuisance parameter 

2 is assumed, provided (4-9), the potential for bias at 9, holds for That is, the nuisance 
determinants r;i, . . . ^r]d^l must each create bias in the same direction, say against Group 1. 

(ii) Stripped of its test bias context and stated as a general theorem about IRT models, a mi- 
nor variant of Theorem 4.2 with < replaced by < at appropriate places is due to Rosen- 
baum (1985). For our purposes, strict inequality is needed- The proof of Rosenbauin's result 
is similar to our proof. 

A final interesting relationship to note is that the presence of test bias implies the potential for 
test bias: 

Theorem 4.3. Suppose that test bias against Group J holds at 9 in the sense of (4-12). Then the 
potential for bias against Group 1 a.t 9 exists in the sense that (4-9) holds. 

Proof. Recall (4-16), replacing rj by 7^ there. Thus for g = 1,2, it holds that 

E[h{U,) I 0, = = I E[h{U,) I 01 = 9,ri, = !l]dF,{ri\9) (4 ^ 19) 

where Fg{7i \ 9) is the cdf of 7? \Qg = 9.S uppose (4-12). Thus, using (4-19) for y = 1,2 it follows 
that 

/ E[h{U,) I 01 = 9,ri, = ri]dF,{ri \ 9) < J E[h{U,) \ = d,^^ = 771^^2(^2 | 9). 

But this implies that the distributions of 17 J 0i = ^ and r;^ | 02 = ^ are different. Thus, invoking 
Assumption 4.3, it follows that (4-9) holds. □ 

4.4 Item Bias Cancellation 

As discussed above, and epitomized by Theorem 4.2, items can combine to amplify bias at the test 
level. In contrast, items displaying bias can also tend to cancel each other out, thus producing 
little or no bias at the test level. This becomes possible only when the nuisance determinant ^ is 
multidimensional with some of its components displaying potential for bias against Group 1 and 
others displaying potential for bias against Group 2. The amount of expressed test bias will be 
a result of the amount of cancellation at the test level and will be dependent on the particular 
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test score h{u} used. The theme of cancellation has been presented by Humphreys (1986) ajvd 
Roznowski (1987) in the non-IRT classical predictive validity context. 

The following example illustrates how cancellation can function to produce negligible test bias. 

Example 4.1. A test of length N {N an even number for convenience) intended to measure 
calculation skills has IRT representation 

{{d = 3,(0„T;i,,7;2,),F3(e,T?i,T?2),{W.m,'72): t = 1, . • . ,iV}}, 5 = 1,2} 

where 9 = mathematics skills, rji = physics knowledge, and T72 = reading knowledge. Let be a 
subtest with IRFs 

(subtest containing problems with a mathematical physics flavor) strictly increasing in ryi for every 
9 and 52 be a subtest with IRFs 

(subtest containing mathematical "word probU: ) strictly increasii^j in rjj for every 9. Suppose 
that the ith physics IRF is identical to the ith word problem IRF, which is the (y + t)th item. 

New, condition on a particular mathematics ability 9, and assume for examinees of ability 9 that 
Group 2 has greater knowledge of physics and Group 1 has i^oater reading skill. So T712 | ^ > r}xx\9 
stochastically and T721 | ^ > t722|^ stochastically for each choice of 9. Say that this holds for each 
choice of 9. Furthermore suppose that as distributions, I ^ = ^721 I ^ ^^^^ ^11 I ^ = ^22 I ^ 
all 9. Then by Theorem 4.2, if subtest 5i were the entire test, it would exhibit test bias against 
Group 1 at 0 for every 6. By contrast il ^2 were the entire test, it would exhibit test bias against 
Group 2 at e for every 9. But, for a large class of test scores-those giving approximately equal 
weight to the Si items and to the 52 items-almost total cancellation of the item biases could occur 
thus producing an unbiased test. That is, for such a test scoring method h{u), 

E[h{ll,)\Qx=9]:^ E[hi!l2)\e2 = 9] 

for every 0. Indeed if h{u) is number correct, then exact equality and hence total cancellation 
results. 



Remark. Note that the concept of test bias compares groups, not individuals. For a pai'ticula 
examinee, a test might be biased against her, even though the test is not biased against Group 1 of 
which she is a member. This important aspect of bias is an unfortunate consequence of the multidi- 
mensional nature of items in most tests. Moreover, it is also a consequence of the unfortunate (and 
perhaps economically unavoidable) fact that only statistical (i.e., group-level) bias analysis is done, 
as opposed to individual case-by-case analysis. The above discussed phenomenon of cancellation 
could possibly alleviate the impact at the individual examinee level (as well, as just discussed, as 
at group level). 

It is worthwhile to develop item bias cancellation in a formal manner. 

Definition 4.7. Item bias cancellation at 6 is sa/d to occur'ifthe test consists both ofitems biased 
against Group 1 at 6 and items biased against Group 2 at 6. 

Remark. It is theoretically possible that cancellation could occur within an item if the item 
depends on at least two nuisance dimensions, as contrasted with the between item cancellation 
of Definition 4.7. This source of cancellation, which seems less likely to occur in practice, is not 
considered in this paper. 

Intuitively, the presense of expressed item bias and no cancellation implies test bias. This is 
the content of Theorem 4.4. 

Tlieorem 4.4. Assume that at least one item displays expressed item bias at 9 in the sense of 
Definition 4.5, and assume that no item bias cancellation occurs at 9. Then test bias occurs at 9 in 
all nan -decreasing item summary test scores h{u) (see (4-11)) provided h{u) is strictly increasing 
in at least one coordinate corresponding to one of the biased items. 

Proof. At the item level, each item is either biased only against one group (Group 1, say) or 
displays no expressed bias by the assumption of no cancellation. Thus, for all r, 

P[Uii = 1 I 01 = ^] < P[U,2 = II 02 = ^] (4 - 20) 

with strict inequality for at least one i. Now, by item invariance. for all J, 

P[l\i = 1101=: (Kv, = ZZl = = 1 I 02 = e,ri^ = 5] E r\{9,ii), 
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Recall Assumption 4.2. Note that, denoting the cdf of (17^ I 0j = ^) l)y ^^giv I 

= l|0i=</] = J Pi{d,ri)dF,{ri\e) 

where the integrand does not depend on £f. It follows from Assumption 4.2 that strict inequality in 
(4-20) for some t implies that (r;^ | 0i = < (ir;^!®^ - ^) stochastically. Thus using the monotone 
condition for h{u) the conclusion follows from Theorem 4.2, noting the remark following the proof 
of Theorem 4.2 concerning multiple nuisance determinants. □ 
It is interesting to note, as Theorem 4.5 now states, that when there is no item bias cancellation 
that test bias for number correct is equivalent to test bias for all nondecreasing item summary test 
scores with strict increase for at least one coordinate of tz. 

Theorem 4.5, (sl) If test b/ajs at 0 occurs for the test score number correct {Yl^^^i ^1) there is 
no item bias canceii at/on a, 9^ then test bias occurs at 6 for every nondecreasing item summary test 
score h{u) for which h{u) is strictly increasing in at least one coordinate ofu. (b) If test bias at 6 
holds for some nondecreasing item summary test score h{u) aiid there is no item bias cancellation 
at dy then test bias at 9 hold for h{u) = ^i- 

Proof, Note that ^ 

Ui9 I 0. = ^] = E / Pi{0,ri)dF,{ri \ 9). 

Then, obvious and minor modifications in the proof of Theorem 4.4 suffice to prove both (a) and 
(b). Details are omitted* ^ 
Intuitively, no test bias and no cancellation implies that none of the items display bias. This is 
the content of Theorem 4.6. 

Theorem 4.6. Assume that no test bias exists at 6 with respect to score h{u). Assume no item 
bias cancellation at 9 in the sense of Definition 4.7. In addition^ assume that there exists at least 
one i such that both is strictly increasing in ji and h{n) is strictly increasing as ti,- = 0 

increases to Ui = 1. Then there is no potential for test bias and (hence) none of the items display 
item bias. 
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Proof, By assumption of no test bias at 9^ 

E[h{lL,) I 01 = = E[h{lL2) I 02 - (4 - 21) 

By the strict increasing assumption for Pi{0,rf} and h{u), it follows that E[h{llg) | ^,t;] is strictly 
increasing in 2- Recall (4-19). If either (4-9) or (4-10) were to hold, it would thus be impossible for 
(4-21) to hold. Thus by regularity Assumption 4.2, it follows that (t;^ | 0^ = ^) = (t;^ | 02 = ^) 
stochastically; i.e., there is no potential for test bias. Referring to Theorem 4.1, we see that none 
of the items display item bias. □ 

Remarks* 

(i) Assuming a scoring method really dependent on all items and that at least one of the items 

actually depends on r/, Theorem 4.6 implies that if there is potential for bias, then either test 
bias results or item bias cancellation results (and possibly both result simultaneously). 

(ii) Theorem 4.2 and 4.6 can be together interpreted as stating a set of conditions under which 
the potential for test bias is equivalent to test bias. 

4,5 Valid Subtest 

Recall the informal definition of a valid subtest from Section 2. As mentioned therein, the reason 
for requiring a valid subtest to exist is that it is statistically impossible to detect test bias using 
only data from an ability test unless there exists an internal criterion measuring only the target 
ability; i.e., a valid subtest. Here we formally define the validity of a subtest. Let 6 denote the 
target ability. Recall from Section 4.2 that all IRFs are assumed strictly increasing in 9, 

Definition 4.8. Let LL a test response with IRT representation (3-7)^ let 9 = r^), and let S 
be a subset of the items 1, . . . ,iV. S is a valid subtest if the IRFs of all itenis in S depend only on 
9; i.e., Pt(^,T?) Pi{9) for each t in 5. 

Remarks. 

(i) From a practical viewpoint one wants S to consist of as many of the items of the test as 
possible; the .statistical power of detecting test bias increases as the proportion of valid items 
does. 



(ii) Consider a specified nondecreasing item summary scoring method h{u) for a test response H 
(recall (4-11). Suitably restrict this scoring method to a subtest response £/', denoting it by 
/i'(n')- For example, if h{u) = L^^^Ui/N, then /i'(u') = E'ui/iV' is the obvious '^restriction", 
where iV' is the cardinality of iT and E' denotes summation over the components of t/'. A 
plausible alternative definition of subtest validity consistent with this paper's emphasis on 
the expression of bias at the test level expressed through the test s :jre would be to require 
of /i'(tz') that for all 6, given 

E[h\U:) I (0,17) = {d^v)] 
depends only on 9 and not on 77. This assertion is equivalent to asserting for all {O^rf) ^^^^ 

Eih'm I (0,77) = {e,ri)] = E[h'{U:) I 0 = ^]. (4 - 22) 

(4-22) is appealing as a possible definition of subtest validity because it functions in an 
aggregate way at the test level based on the specified test scoring method "restricted" to the 
subtest. Evoking the usual empirical interpretation of expectation, (4-22) says that repeated 
sampling of examinees from ability groups, both with the same value of 6 but with any choice 
of two different values of produces on average approximately the same value of /i'(f/'), 21s 
one would wish a "valid subtest" to do. 

Fortunately, however, this alternate and appealing definition is actually equivalent to our Def- 
inition 4.8, under the natural and mild regularity condition that /i'(ti') be strictly increasing 
as u,' = 0 is increased to u,* = 1 for each component tit of v!; that is that /i'(ti') must really 
depend on each of the valid subtest item responses. This assertion follows from a modifica- 
tion of the proof of Theorem 4.2. Thus our definition of subtest validity can be thought of as 
operating either at the item level (Definition 4.8) or at the test level ((4-22)). 

(iii) Assume a two group representation (4-3). It is perhaps interesting to note it is possible for 
all e that 

E[h\U:,) I 01 = ^] = Eih'm I 02 = ^] (4 - 23) 

and yet subtest validity not hold. Note here that (4-22), equivalent to subtest validity, 
implies (4-23); however, (4-23) should not be used as a definition of subtes* validity. As an 
extreme example demonstrating this claim, each item of S could be measuring rji alone with 



Jig independent of 0^ for = 1 and g -2 and 77^ having the same distribution for y = 1,2. 
Subtest validity obviously does not hold here because the supposed-to-be valid items may be 
heavily influenced by 77; however, 

E[h'{ll!,)\Q^:=9] = E[E{h'{U!,)\Q,=e,ri,}\Qi=e] 

= E[E{h'{U:,)\ri,}\Qi-^d] 

= Eh'iU:,) (4-24) 

= Eh'iU:,) = ... 

= E[h'm\Q, = 9]; 

SO (4-23) does hold here. The point we have just shown is that the absence of test bias (i.e., 
that (4-23) holds) does not imply test invalidity (i.e., that (4-22) fails). Related to this fact, 
note that test validity for the entire test in the sense that (4-22) holds for all {d^t]) for some 
scoring method h{u) that is increasing in every component Ui of n does imply for everj*^ that 
no test bias exists. This follows trivially from the fact that test validity for the entire test 
means that every item depends only on 6. 

5 Test Bias: The Long Test Case 

The theory of test bias presented in Section 4 shows that if there is at least one nuisance dimension 
then test bias may be present. It is well known that purely unidimensional tests are rare among 
typical aptitude and achievement tests (see Ansley and Forsyth (1985), Humphreys (1984), Reckase, 
Carlson, Ackerman, and Spray (1986), and Yen (1984), among others). The position is summarized 
well in Humphreys (1984): 

The related problems of dimensionality and bias of items are being approached in an 
arbitrary and oversimplified fashion. It should be obvious that unidimensionality can 
only be approximated. . . . The large amount of unique variance in items is not random 
error, although it can be called error from the point of view of the attribute that one is 
attempting to measure, , . .We start with the assumption that responses to items have 
many causes or determinants. 

How does the empirical reality of multiple determinants on a test interact with our multidimen- 
sional model of test bias? There are two cases to consider: either the test is ''long" or it is "short". 
By "long" it is meant that the number of items is large enough that asymptotic probabilistic ar- 
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guments provide a useful approximation to the actual test operating characteristics. For example, 
for many purposes a test of 40 items can be clcissified as "long". 

In the case of a short test, several of the results in Section 4 are important: First, even if 
nuisance determinants are present in the items and influence examinee performance, the potential 
for bias against a group must exist in order for test bias to be possible. Second, if the amount of 
expressed bias at the item level is sufficiently small, then the amount of bias possible at the test 
level is bounded above. However, if little or no cancellation occurs, small amounts of bias at the 
item level can produce a substantial amount of test bias. Indeed, one can imagine a detrimental 
amount of test bias, but with statistical testing for individual item bias being unable to detect 
any bias at the item level. Third, the amount of test bias is dependent upon the scoring method, 
the scoring method being the link between item and test bias. It is possible that some scoring 
methods might be more robust against the detrimental influence of item bias than others. Fourth, 
recalling Example 4.1 and the material on item bias cancellation, it is quite possible to minimize, 
with the hdp of an aptly chosen scoring method, the amount of test bias by having different biasing 
influences cancelling each other out. For example, (again recall Example 4.1) if approximately equal 
numbers ol items express approximately equal amounts of bias, respectively against and in favor 
of Group 1, then provided the scoring method gives approximately equal weight to the two classes 
of items, little or no test bias should ocr ir. Intuitively, it seems likely that having many minor 
dimensions in addition to 6 might increase the propensity for cancellation and actually result in 
less test bias. However, in spite of certain encouraging aspects of the above remarks, it is surely 
the fact, because of the intrinsic multidimensional nature of ability tests, that serious amounts of 
test bias are likely when tests are short. 

We now turn the discussion to the development of a "long" test scenario. In the study of test 
bias in along test, the theory of essential unidimensionality of a test, as developed by Stout (1987, 
1989) and refined by Junker (1989a, b) turns out to be useful. First we summarize the relevant 
conce; s of this theory. 

A "long" test response iZ/v is conceptualized as being the initial o65en;ec/ segment of a potentially 
oiservable infinite item pool {Ui^i > 1}. It is assumed that whatever process has been used to 
construct the first N items of the pool (i.e., the observed test H^s/) could have been continued in the 
same manner to produce > 1}. With this understanding, in order to do asymptotic statistical 
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theory and for foundational purposes, we study {Ui^i > 1} instead of IJjsf = {Ui^l < i < iV}, 
conceptualizing the item pool {Ui^i > 1} as the "^test**. A test {Ui^i > 1} is defined to be 
essentially uni dimensional {ds = 1) if it has an IRT representation with monotone IRFs but instead 
of requiring local independence (Assumption 3.1), the weaker assumption is required that 

^i<i<j<N\cov{Ui,Uj\Q = 0)\ „ 

as iV 00 for every 6. (The requirement of monotonicity can be weakened somewhat when 
modeling items where non-monotonicity is suspected, but we omit discussion here (see Stout, 1989; 
Junker, 1989b). When ds = 1, it is shown that the latent ability is unique in the sense that any 
other djB = 1 IRT representation has a latent trait that is a monotone rescaling of 6, (E.g., a 
mathematics test cannot be a test of geography for the reason that there exists no such rescaling.) 

We now must specify a class of scoring methods for the sequence of long tests {ILi^^N > 1}. 
It is convenient to consider a large class of such scoring methods, but less extensive than the non- 
decreasing item summaries (4-11). Recall from mathematical analysis that a collection of functions 
{kN{x)} is eqnicontinnons if foT every €> 0 there exists S > 0 such that 

for all N and all x, y for which \x - y\ < S, Note that the assumed continuity is uniform both in 
the argument and in the choice of function. 

Definition 5.1. {^/v(I]ili CLNiUi)} is called an eqiiicontinuous balanced scoring method provided 

(a) kf^{x) is defined on [0,1], is non-decreasing, and satisfies 

-00 < \n{kf^{0) < sup/:A/(0) < inf ^^/(l) < supfc^v(l) < oo. (5 - 2) 

^ /V ^ N 

(b) {fc/v(x)} is equicontinuous, and 

(c) {a/vi : 1 < t < iV, iV > 1} satisfies 0 < aj^i < C/N for some C > 0 and for all i, N and 
ZiLi a/vt = lVor allN, 




Remarks. 

(i) (5-2) and (c) merely guarantee that the ^^empirical*' scale established by ciNiUi) does 

not shrink to 0 or stretch to oo as iV varies. For example, if - fc;v(0) — ► 0 as iV oo, 
then fcAr(I]ili ciNiUi) for large N is uninteresting. 

(ii) The a^^,- < C/N guarantees that no single item dominates the score; i.e., the scoring is 
"balanced". 

(iii) A remark on notation is appropriate. An arbitrary scoring method hi^{U^i^) assigns a score 
to each test response Uj\/ and hence hi^{*) is a function with an iV-dimensional domain 
(such a score occurs in (4-11)). By contrast, an equicontinuous balanced scoring method 
^/vdli^i ^Ni^i) s^signs a score to each linear combination YliLi for each N and hence 
/:yv(0 is a function with a unidimensional domain. 

A fundamental result of "long" test theory is that of a test {Ui,i > 1} is essentially unidi- 
mensional, consistent estimation of 9 is possible in the sense that for any equicontinuous balanced 
scoring method, given 0^ = ^, 

kN - kM ^^a;v.T.(^)j - 0 (5 - 3) 

in probability as iV — ^ oo, for £f = 1,2 (established by a minor modification of the proof of Theo- 
rem 3.2 in Stout (1989)). That is, 9 is estimated with total accuracy in the limit, using the latent 
scale ^ 

kN {YlaNiTi{9)y 

Here Ti{9) denotes the marginal item response function defined by Tt{9) = E[Pi{Q,)\Q = 9]. Ex- 
pectation is over both groups here; that is, 0 is the target ability of a ra.idomly chosen examinee 
from the pooled group resulting from combining the two groups. An important special case is that 
when dg = 1, given Qg = 9^ 

53[/.-,/A^-f;W)/iV-.0 

1=1 1=1 

in probability as oo, for fip = 1,2. 

Armed \yith the above concepts, a "long-test" definition of test bias is now given. The intuitive 
idea is that if the test scoring method being used measures target abilily equally well in both groups 
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as measured by the convergence in probability behavior cis — ► cx), then no test bicis exists. Let 
denote the infinite item pool for Group g and let 

UNg = iUig,...,UMg) 

denote the finite observed segment of the item pool for g. To study long-test test bicis, we make the 
assumption that LLg hcis a two group representation of the form (4-4) with (3-4), Assumptions 3.1 
and 3.2 holding within each group and with Assumption 4.1 holding. It then follows from the 
ordinary weak law of large numbers in probability theory for any equicontinuous balanced test 
scoring method that, given = ^ and ^2 - £1 

and (5-4) 

in probability as A'' 00. Here 9 = {9^r]) where 6 is the target ability and r] is the nuisance 
determinant. Of course, in order to be able to assume local independence for the representation 
(4-4) and have good model fit the dimension d of 77 may need to be quite large. It is easy to show 
(5-4) also holds for an ds essential dimensional representation of the form (4-4), with ds possibly 
much smaller than d. 

Because fc/v(2Z,^i fl/Vt'/^ii) said kisj{YliLi^NiUi2) have the same limit behavior in probability 
(hence {0,7]) is measured equally well in both groups), (5-4) seems to suggest that no test bias in 
a long-test sense is possible. However, (5-4) is not the same as group-equivalent measurement of 
target ability 9 alone. As in the finite test length case of Section 4, the source of bias is that the 
conditional distributions of (^^|0i = 9) and (^ji®^ = ^) differ, thereby leading to superior limiting 
test scores for one group versus another given 0i = 5, ©2 = 9. An example should clarify this 
claim. 

Example 5.1. Consider examinee subpopulations from the two groups defined by ©i = ^ and 
©2 = 0, respectively, i.e., both subpopulations have the same target ability. Suppose that there is 
a single nuisance determinant and that 

P[rj,=l\Q,=e] = \ P[rj, = 1\Q, = e] = 

(5-5) 

P[Vx =O|0i =^1 = 2 = o|02 =. ^) = i 
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Clearly ihia is a case of potential for bias against Group 1 at 9. Suppose k!\f{x) = x for all and 
ajsfi = l/N for all i and N: 

N 

Suppose local independence with respect to (5,7/) with 



for all Then, (5-4) specializes to 



' 3' * 3 

given 0x = 5, = 1 and 02 = 0,rj2 = 1, respectively, in probability as — ► oo. Also 

3' ^ 3 

given 01 = 5, 77^ = 0 and 02 = 5, = 0, respectively, in probability as oo. But, conditioning 
on 01 = 5 and 02 = it follows using (5-5) that 



^''y § with probability i and 



-* \ with probability^ 



(5-6) 



as 00, as contrasted with 



y I with probability! and 



(5-7) 



^" ^ with probability \ 
N -* oo. Clearly Group 2 is favored among examinees of target ability 6. It may be interesting 



to note that 



N 



\Qi = o 



4 3^4 312 



for all N , while 



i=d=iili2|0 ^ ^ =,1.14.2. 2 = X 

\KJ2 ~ u - ^ g-r^ 312- 



(5-8) 



(5-9) 



Thus, in a trivial manner not dependent on N ^ 



N 



101 = ^ 



- E 



%^|02 = ^ 



= -i<0. 



(5 - 10) 
□ 



We will use the idea embodied in (5-10) to define large sample test bias. 
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Definition 5.2. Let 6 denote target ability. 

(i) There is no long-test test bias at d with respect to an equicontinuous balanced scoring method 

f N \ 



provided 

E 

as JV 00. 



N 



(5-11) 



(ii) If for every 9, there is no long-test test bias at 6, then there is no long-tesi test bias. 



(Hi) If at 6 



N 



-E 



N 



< C < 0 



(5-12) 



for all sufficiently large N and some C, then long-test test bias exists at 9 against Group 1. 

a 

We first show that if there is no long-test test bias in the empirical sense that among examinees 
with the same target ability 9 neither group is favored in their stochastic test score behavior as 
N -* 00, then long-test test bias in the sense of Definition 5.2 holds. 

Theorem 5.1. Suppose, given Qi = 9 and 0)2-9 that for an equicontiauous balanced scoring 
method, 

kN(y:iLiamU,A-CNii9)-*Q and 

kN [ZiLi amUi2) - CM2i0) - 0 ^ ^ ^ 

in probability for some cm{0), cn2{0), as N -* oo. Then (5-11) holds; that is, tiiere is no long-test 
test bias at 9 for the given scoring method. 



Remark. Note that it is not required that the centering functions ci<^g{9) have to be the same for 
5 = 1,2. What is required is the existence of a centering function dependent on 9 alone and not 
2 for each g, as contra.sted with (5-4). Of course, the case where the centering functions are the 
same is of special interest and is the main motivation for the theorem, as the remark immediately 
prior to the statement of the theorem indicates. 
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Proof. By (5-2), \kN{x)\ < C for some C > 0. Thus |c;vs(^)| < C + D for some D > 0. By the 
Lebesgue dominated convergence theorem (see p. 11, Serfling, 1980), using (5-13) 



0 



(5 - 14) 



as iV 00, for y = 1,2. Now, trivially, the conclusion (5-13) holds given 0i =r ^, 77^ = rj and 
Q2 = ^, Vi2 ~ R ^ R' '^^^S' subtracti ig the two results in (5-13), 

cm{0) - CN2{0) -* Q 

as iV -4 00. Let c^r(^) = c;vi(^). It then follows from (5-14) that 

■ ;v 

^aNiUig-CN{e)\Qg = 9 

L«=l 

as i\r — ' 00 for £f = 1,2. Subtracting these two limits yields 

' N 1 r N 

.»=i 

as i\r 00, i.e., no long-test test bias exists at 6. 
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Remark. We claim that (5-13), (and hence the similar condition (5-3)), is inappropriately strong 
to use as a definition of lack of long-test bias. To see this, modify Example 5.1 by assuming 

?[t;, = O10, =^] = |,?[77, = 110, = ^]=i, 

for flf = 1,2. Hence no potential for bias exists. However note that, given 0i = ^ and 02 = ^ 



N 



- 5 0 with probability \ 



and 



N 



UNg - 3 0 with probability ^ 

1=1 

for both fif = 1 r.nd g = 2, Thus (5-13) is precluded and thus long-test bias would be said to exist 
(even though no potential for bias exists) if (5-13) was made the basis for deciding on the existence 
of long-test test bias. Note that the above convergence in probability behavior is identical foi both 
groups. Intuitively, in this example the estimation ol 6 by YliLi Upjg/N as N oo is equally bad 
for both groups in the sense that convergence in probability at 6 fails to occur in exactly the same 
manner in both groups. Thus one would not wish to claim that test bias is occurring. 

The following theorem states that essential unidimensionality is a sufficient condition for ensur- 
ing that no long-test test bias exists. 
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Theorem 5.2. Suppose ds = I for target ability 9 in the combined popy.'at/on consisting of 
Group 1 and Group 2 examinees. Then, with respect to all equicontinuous balanced scoring meth- 
ods, no long-test test bias exists. 

Proof. Let {kj^iYliLi ^NiUi)} be an arbitrary equicontinuous balanced scoring method. We need 
to prove (5-11) for every 9. Fix 9. By work of Stout (1989), dE = 1 implies (5-3) for g = l,2\ i.e., 
(5-13) holds with CNg{9) = kN{Y^a^iTi{9)). Thus, by Theorem 5.1, the desired result holds. □ 
By contrast, if the potential for bias exists at 9, then it follows that there exist balanced scoring 
methods for which long-test test bias at 9 does exist. 

Theorem 5.3. Assume that IRFs are differentiable in r). Let 9 denote target ability, t] denote the 
nuisance determinant and assume potential for bias against Group 1 at 9. Assume there exists a 
balanced scoring method {ayvt} (i.e., k^ix) = x in Definition 5.1) such that at 9, 

d ^ 

— ^a;v,P.(^,r7) > >0 (5-15) 
for a]] r/ and all N. Then long-test test bias exists at 9 against Group 1. 

Proof. For 9 = {9,7]), (5-4) holds given ©i = ^, 771 = V\ ©2 = 9,71^ = rj. Now, letting Fg{7]\9) 
denote the cdf of t;^!©^ = 9 and using (5-15) and integration by parts 

= IZ^iLli <iN.Piie,ri)]d[FM9) - F^m] 

<roo^r,[F^m-F,m]dri 

< -c(^), 

where c{9) > 0 by the assumption of potential for bias aguinst Group 1. Since this holds for all N, 
the result is proved by Definition 5.2. □ 
How is the finite test length definition of test bias (Definition 4.6) related to the long-test test 
bias definition (Definition 5.1)? The answer is that lack of finite length test bias for all finite length 
test llf^ from the item pool {(/,-, t > 1} implies lack of long-test test bias for all equicontinuous 
balanced test scores. 

Theorem 5.4. Assume an IRT representation for {(7,,t > 1} of the form (4-4) for 9= {9,ti). Let 
{ksiYliLi (^NiUi)] be an equicontinuous balanced scoring method. Assume no finite length test 
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bias exists; that is, (4-13) holds for all N. Assume regularity Assumption 4.2. Then there is no 
long'test test bias; that is, (5-11) holds, 



Proof. Trivial from examination of (4*13) and (5-11). D 

Remark. Of course, long-test test bias holding is less restrictive than finite length test bias 
holding. Nonetheless it seems an appropriate way to describe biasedness of a test when the test is 
long. 

From the long test perspective, the need to produce a long-test definition of a valid subtest needs 
to be addressed. Previously in the short test case, our definition of a valid subtest S with response 
£ was stated to be equiviilent to (4-22) holding for all (^,2). Just as the short-test version of no 
test bias ((4-13)) is modified for the long-test version of no test bias ((5-11)), a similar modification 
of (4-22) yields an appropriate definition of a valid subtest. We consider only equicontinuous 
balanced scoring methods for subtests H'j^ of I/yy. That is, we consider scoring /c}v(X^' ciNif/i) where 
Definition 5.1 holds, for each fcJy(2'o^/,[/,) where Yl' denotes summation over the indices of the 
components of i/';^. 

Definition 5.3. Let the item pool {[/,-, i > 1} have IRT representation (3-7) with the usual 
accompanying assumptions, Let IJ^^ C Um denote a subtest of Uj^ for each N, Denoting the 
cardinality of a set A as card (A)^ assume 

£'nC£;,+„ '-i^>C>0 (5-16) 

for some C and for all N > Nq for some fixed No (No will be small in all applications). Then 
{H'sfN > 1} is said to be a collection of valid subtests with respect to a specified equicontinuous 
balanced scoring method {^'i^-iTJ <^NiUi)) provided there exists a function c^iO) such that for all 

E[k'f,{l:'aNiUi) j 0,7).) = (^,2)] - c^(^) - 0 (5 - 17) 

as jV 00. 

Remark, H.ecall that short-test bias validity, i.e., (4-22) hold for all (f,?/), for scoring method 
fc'f^iYi' ^Ni^'t) say, simply means tbnt for 6 
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depends only on 9 and not on 2- By contrast the long-test subtest validity just defined by (5-17) 
weakens this to asserting that m{9,ri) for all ^ is asymptotically not dependent on as iV oo. 
That is, intuitively, for large fixed N, m{6,ri) for all ^ is approximately constant as t/ varies. 

As with long-test test bias, the theory of essential unidimensionality is useful in studying long- 
test subtest validity: 

Theorem 5.5. Assume cIe = I with latent ability 9 being target ability for subtests {U',^, N > 1} 
satisfying (5-16). Then (5-17) holds for all equicontinuous balanced scoring methods; i.e., subtest 
validity holds for all equicontinuous balanced scoring methods. 

Proof. It follows from a minor modification of the proof of Theorem 3.2 in Stout (1989) that for 
aU (6,11) 

k'f^iU'aNiUi) - cn{9) -^0 (5 - 18) 

in probability as N oo. But Wf.-i'Z' ciNiUi\ < C for some constant C < oo. It is a standard result 
from the theory of convergence in probability that convergence in probability and the boundedness 
just stated together imply convergence in expectation. That is, for all (^,t/), 

E[k'f,{^'aNiUi) i 0,7?) = 9,ii] - cn{9) 0 

as N 00. I.e., (5-17) holds. □ 
Stout (1987) has developed a statistical test for essential unidimensionality. Clearly this could 
be applied to a subtest to assess whether it can be used as a valid subtest in the case of a "long" 
test. 

6 Test Bias as a Function of Target Ability 

Sections 4 and 5 focus on test bias for fixed values of target ability 9. In these sections it was 
argued that test bias (item bias also) is a phenomenon that expresses itself at each 9. In particular, 
it is the comparison of the distributions of (t^JQi = 9) and (rjjlQs = 9) that dictates whether test 
bias is possible at 9 and if such bias is possible, in which direction (biased in favor of or biased 
against Group 1) it occurs. Mathematically, without further assumptions, one cannot infer what 
the character of the bias at ^' ^ ^ is from the character of the bias at 9. This section develops 
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the concept of considering test bias aggregated over target ability. We return to the convention of 
suppressing N in the notation when appropriate; e.g. H == Hjsi* 

Definition 6.1. Let h{ll) be a test scoring method and be a test response as in (3-1). The 
expected test bias ait 0 against Group 1 using test scoring method /i(iZ) is given by 

B{e) = E[h{E2)\Q2 = ^] - E[h{U,)\Qi = 91 (6 - 1) 

□ 



Remarks. 

(i) Note that B{9) > 0 indicates test bias against Group 1 at 9. 

(ii) Several special cases are of interest. If h{u) - YliLi ^^^^ -S(^) is the difference of 
(marginal) test characteristic curves (average of marginal IRFs): 

BW = £ikM£) _ SiM. (6-2) 

If h{u) - tij, then 



B{d) = Md)-Ti,{e), 

the amount of item i bias against Group 1 at 9. 

Probably the most common pattern in the potential for bias as a function of 9 is unidirectional 
potential for bias: 

Definition 6.2. If potential for bias exists a^ga^inst the same group at every 9 then unidirectional 
potential for bias is said to exist aigaJnst the group, D 

Another less common, but still important pattern in the potential for bias as a function of 9 is 
that the "direction" of the potential for bicis changes from one end of the ^-continuum to the other: 

Definition 6.3. Suppose for some fixed 9o that the potential for bias against one group exists for 
aii ^ < ^0 and the potential for bias exists against the other group for all9 > 9o. Then bidirectional 
potential for bias is said to exist, ^ 

The verbal analogies example of Section 2 is an obvious practical example of unidirectional 
potential for bias. For, it seems likely that the potential for test bias against German immigrants 
will hold regardless of the level of verbal analogies ability being conditioned on. 

a:: 
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As an example of bidirert^ ' - mtial for bias, suppose 0i and 02 are both uniformly dis- 
tributed on thi' ;nt?...u [-1,1]. Suppose that in Group 1, 0i and 771 are statistically independent 
with I/- ....uormly distributed on [-1,1]. Suppose in Group 2 that (i72|02 = ^) has a uniform dis- 
tribution on the interval with end points 0 and 29. That is, perhaps because of cultural differences, 
in Group 2 it follows that 0 and rj are highly positively correlated while 0 and t] are uncorrected 
in Group 1. Elementary computation show that if -1 < ^ < 0, (4-10) holds, yet if 0 < ^ < 1, (4-9) 
holds. That is, potential for bias against Group 2 holds for ^ < 0 and potential for bias against 
Group 1 hold if ^ > 0; i.e., bidirectional potential for bias holds. 

Test bias (and item bias) can be undirectional or bidirectional. 

Definition 6.4. If test bias (either in the ordering sense of Deiinition 4.6 or in the long-test sense 
of Definition 5.2) exists against the same group at every 6, then unidirectional test bias against 
that group is saSd to hold. 

Definition 6.5. If for some ^0 test bias in the sense of Deiinition 4.6 holds against one group for 
all 9 <9q and against the other group for all 9 > 9o then bidirectional test bias is said to occur. □ 

A long-test version of Definition 6.5 is easy to give but is omitted for simplicity. The following 
results relate unidirectional potential for bias to unidirectional test bias. 

Theorem 6.1, Suppose test bias exists agsdnst Group 1 at some 9 in the sense of Definition 4.6, 
and suppose unidirectional potential for bias. Assume a test scoring method of the form (4-11). 
Suppose for every 9' that there is some i (possibly dependent on 9') for which h{u) is strictly 
increasing as m = 0 increases to u,- = 1 and for which Pi{9',T]) is strictly increasing in j]. Then 
unidirectional test bias against Group 1 holds. 

Proof. By Theorem 4.3, the potential for bias against Group 1 at 5 holds. By assumption of 
unidirectional potential for bias, the potential for bias against Group 1 thus holds for all 9'. Apply 
Theorem 4.2 together with the remark (i) following it. □ 

Theorem 6.2. Assume IRFs are differentiable in rj. Suppose long-test test bias exists against 
Group 1 at some 9 in the sense of Definition 5.2 for a baiaiiced scoring method {rt,v,} and suppose 
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unidirectional potential for test bias. Assume for every 9^ 

d ^ 

for all 7] (without loss of generality assumed unidimensional here). Then unidirectional (long-test) 
test bias against Group 1 holds in the sense of Definition 6A. 

Proof. Same as that of Theorem 6.1 except Theorem 5,3 is used in place of Theorem 4.2, □ 
In order to study bidirectional test bias, -attention is restricted to balanced scoring methods. 
For an arbitrary balanced scoring method ^{Li^NiUi^ letting 

and assuming differentiability of IRFs and a unidimensional nuisance determinant, the following 
formula for B{9) of (6-1) obtained by integration by parts is useful 

= /^|E«Ni^W,r/)|[i^i(r7|^)-F2(r7|^)]dr,. (6-3) 

Theorem 6.3, Assume a balanced scoring method with differentiable IRFs. Assume a unidi- 
mensional nuisance trait rj. Assume for each 9^ there exists some i (possibly varying with 9) for 
which 

am > 0, -r-P,(^, r;) > 0 for all r; > 0, (6-4) 
Then bidirectional potential for test bias holds if and only if bidirectional test bias holds. 
Proof, By Assumption 4,2, for fixed 9 either 

Fi{r}\9) - F2{r)\9) > 0 for all r; (6 - 5) 

or 

Fiiv\9) - F2{r)\9) < 0 for all r;. (6 - 6) 

Thus, using (6-3), (6-4) and the strict monotonicity of every Pi{9,7]) in 7], B{9) > 0 or B{9) < 0 
accordingly as (6-5) or (6-6) holds. Potential for bias at 9 means that either (6-5) or (6-6) holds at 
9, The desired result follows, D 
.Assume number correct scoring, which implies (6-2) and hence that test bias is controlled by 
the (marginal) item response functions with respect to target ability. Graphically, bidirectional 



test bias under this scoring method is shown in Figure 3. Note the effect is that the test displays 
higher discrimination for Group 2 than for Group 1. That is, bidirectional test bias is expressed as 
differing test discriminations for the two groups. By contrast, under (6-2), unidirectional test bias 
is shown in Figure 4. Unidirectional test bias is not linked to differing test discriminations across 
group. Indeed the two test characteristic curves shown in Figure 4 can even be translates of one 
another; e.g., for some c> 0 for given T.^C^) = Ti{9) 

ZTa{9)/N = ^Ti{d + c)/N 

for ail 9. That is, items could be uniformly more difficult for Group 2 examinees at every 9. 

There is a debate about whether from the cognitive perspective, differing discriminations across 
group is more the essence of bias than differing difficulties across group. Also, some practitioners 
claim that bidirectional test bias can be important in practice while others discount its importance. 
It is hoped that Section 6 helps illuminate these issues. 

7 Discussion and Summary of Results 

The central position of this paper is that bias should be conceptualized, studied, and measured at 
the test level rather than at the item level. A multidimensional but non-parametric IRT model 
of test bias is presented and a number of important properties derived. Our theory of test bicis 
includes the often used unidimensional IRT bias approach as a special case. 

The model hypothesizes a target ability intended to be measured by the test as well as other 
dimensions called nuisance determinants, not intended to be measured. Informally, test bias occurs 
when the test under consideration is .neasuring nuisance determinants in addition to the target 
ability, and moreover the two groups do not possess equal amounts of the nuisance determinants. 
Our view, an outgrowth of the classical predictive validity viewpoint of bias, is that bias is really 
something expressed at the test level via the particular test score in use and that bias rests in the 
across-group differences in the relationship between test scores and criterion. For us the "criterion" 
is internal to the test and is expressed by a "valid" subtest known to consist of items measuring only 
target ability. In ^rder to statistically detect test bias, a valid subtest must exist and be identified. 

In Section 3, the multidimensional non-parametric IRT model is presented. The notion of the 
marginal IRF with respect to target ability is introduced. 
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In Section 4, test bias is carefully defined using the IRT model introduced in Section 3. Test 
bias originates w'^h the potential for test bias at a particular value of 6 of target ability existing 
against a group in the sense of Definition 4.3. This potential for bias against Group 1 gets expressed 
at the item level if any of the marginal group IRFs satisfy Tii{6) < Ti2{d). The potential for test 
bias and a strictly increasing IRF in ^ implies expressed item bias (Theorem 4.1). 

The main focus of this paper is on biased items acting in concert. Three components combine to 
produce test bias: (a) potential for bias, (b) dependence of the IRFs on r?, and (c) the test scoring 
method, which transmits -simultaneous expressed item bias into test bias. Test bias is formally 
defined in (4-12). It is shown that test bias at 6 implies the potential for bias at 6 (Theorem 4.3). 
The central result of Section 4 (Theorem 4.2) shows that potential for bias at 6 translates into test 
bias at 6 provided the scoring method depends on at least one item that has a strictly increasing 
IRF in 2 at 6. 

The important topic of item bias cancellation is taken up in Secticm 4.4. Example 4.1 illustrates 
how cancellation can actually decrease the amount of item bias that gets expressed at the test level. 
That is, the potential for bias need not be strongly transmitted to the test level because in fact 
considerable cancellation can occur as the result of multidimensional nuisance determinants. By 
contrast, small and perhaps undetectable amounts of bias at the item level can be translated into 
a substantial amount of bias expressed at the test level when no canc^illation occurs. Section 4.5 
formalizes the notion of a valid subtest, which must exist for text bias to be detected. Shealy and 
Stout (1990) present a statistical test of test bias, making the question of whether test bias does 
exist for a particular data set an answerable one. 

Section 5 presents a long-test viewpoint of test bias, making heavy use of Stout's theory of essen* 
tial unidimensionality. No long-test test bias holding is defined. It is shown that if an equicontinuous 
balanced test score (a lar^c class of reasonable to use test scores are such) displays appropriate con- 
vergence in probability behavior separately in each examinee group, then there can be no long-test 
test bias. Essential unidimensionality (d/j = 1) of a test with target ability as the latent trait 
is shown to exclude long-test test bias. Because one can statistically test for essential unidimen- 
sionality (Stout, 1987), this is a potentially very useful result. Theorem 5.3 is important as the 
^ng-test analogue to Theorem 4.2. It links potential for bias and scoring method to the existence 
of long-test test bias. 
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A long-test viewpoint of subtest validity is also present in Section 5. Informally stated, the 
main result is that <ijs = 1 for a subtest with the latent trait being target ability implies subtest 
validity for all equicontinuous balanced scoring methods. 

Section 6 considers test bias aggregated over target ability. The important concepts of unidirec- 
tional and bidirectional test bias are introduced. The relationsliip between differing discriminations 
across groap and bidirectional test bias is explicated. 

It is hoped that the above theory of test validity proves useful to theoreticians and practitioners 
alike. 

Acknowledgement. The authors found discussions with Terry Ackerman, Paul Holland, Lloyd 
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Dnkmikf of lUinoii 
ChMBpoigni U« ^1001 

Dr. Cbortai Liwii 
EducMiooil Tauing Scrwiot 
PriocMoa NJ 00541^1 

Mr. RodfMy Um 
Utikmwtpf of lUirtoit 
DtporUMOt of Pfychok»0 
40) E D»oid St. 
Chaayjim. It ^1020 

Dr. Robort L lino 
Cwipm Bot 249 
Uoi^miiy of Color»do 
BouUtf.CO 00)094249 

Dr. Roh«t LoekflMii 
Comor for Nfvtl AnAlyiii 
4401 Ford Av«nm 
P.O. Botf 1i2a 
AkMdfk, VA 22y>2'02a 

Dr. Fradc^ M tord 
EducMiorMl TmUoi Scrviot 
PrioMion, NJ 06541 
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Dr. RiduN LuMfai 
ACT 

P. O. fkm ia 
iM Giy. U 5243 

Dc 0«offc B. M^crmdf 

SiMiMiai A &«k»iiea 

Cottft F»ft. MO »74^ 
Dr. Oify Mmvo 

Siof 3i*e 

E4ucMkNMJ T»Un| Scfvict 
Friowioiv 0M51 

Dr. Ommd J. MftrUn 
CMiM o( QmT o( Nffvit 
Opwiiiom (OP 13 F) 
Kff«y Aam. Room :S3: 
WiibiA^oA,DC »350 

Dr. JwM R. McBhdt 

HuaRRO 

4430 Etebum Drf^ 

sm d»c|o» ca ni3o 

Dr. Qartnet C McCoraick 
HO. USMEPCOHWEPCT 
2500 OrMQ Bty KoU 
Kortb Ghki|tt» IL 40064 

Mr. Chriiiopbtr McCuiktf 

DipUUMM of Pmfaoio0 
iOE DmWSl 



Dr. ReWt MeKjoIfy 
PrtelM. Ot541 

Mr. AImMm^ 
do Dr. Mkh*^ UviM 
EAMitioml Piycbo>o0 
210 BiMCMkM M|. 

Dr. Tittotby MiS«r 
ACT 

P. 0. Bm )4« 

loM Oty. IA SrO 

Dr. Ro^ MUf«y 

PrtftOrton. Hi (A541 

Dr. WaiM MoAUfM 
KPRDC Code 13 

DMfO, CA ni524aoo 

M& K*iblMD Momo 
Hf^ Pcnonofll R&D Ccmtr 
Code 41 

$m DU|«. CA 9:i5:4M 
H w4^Mtrun M»rmc Corpt 

Coat MPi'» 

WMbinfUXk DC »DM 

Dr. RiUM H»r>dikum*r 
Eiuctiional Sut4m 
WM HiB. Roon 2Ue 
UoMnity o( D«lf%«rt 
DB 19714 

UWiry, SPRDC 
Code raiL 

Sm D>«to. CA 92t5:«^ 



Kml Cmut for App<i«d R«MMtt 

ia AniTieift] InuUiicrm 
KmI RmmxIi Ubonuxy 
CMk5510 

WhUqIMDC 20375*5000 

Dr. HmtxM p. O-Nal Jr. 
Scbod o( EducaiioQ • WPH 801 

P i ycbotoy A T«bnob0 
Uo^OTNty of Soutbtm Californit 
UiAn|*lM.CA 9000MD1 

Dr. Umm % Okm 
W)CATS«uoi 
1I7S South Suit Sinm 
Oraa. 17TMQ5I 

Ofliot of NiviJ RMc»rtta» 

Codt 114X3 
100 N. ^MKy Suwi 

Aiiinpoa VA :::i7.5ooo 

«Copi«) 

Dr. 3yii(b OrvMnu 
BMk Rwrcb Ofto 
Amy RoMird) InMimu 
$001 EiMftbott^ Av«f)u< 
AkandriA, VA 2233 

Dr. 3«M OfUMky 
lotUuit for D«f CAM Aaa^m* 
1101 H. teur*prd Sc 
Akan4m, VA 2311 

Dr. Pour 3. PatbWy 
EiuoiUonaJ 7mijn% Strviot 
Roiodak Rood 
htemoft. Hi 00541 

Wiynt ML PatMOoo 
AMfkM Couod oe EdimioQ 
GEO T«Uo| Sofviet, Suiu 20 
Om Dupom Cird«, KW 
WMbia^ DC 3)034 

Dr. imm Ptukoa 
Diportiwi of Piycfeotocf 
PonlMMl Suit UnM«iicy 
P.a Boi 731 
Poftknd, OR f7307 

Dipt of AdainitUM^ So«m« 

Codt 54 
Nfvol Poiip^usu School 
Moourty, CA 93M>50:d 

Dr. MMt D. RkUm 
ACT 

P. O. Boi ia 
toM Gty, lA 5243 

Dr. Mtkolm R« 
AFHRIVMOA 
BmkA AFB. TX 78235 



Kk. SM««R<iM 
NddOEikM Ha 
UaK«ru<y of MinntMU 
75 E. Ri^ Rotd 

MN 554554M4 



DcCtH RoM 

CNET.POCD 
Ptifldini 90 

Orwl Uim NTC TL 400ej 
Dr. J. Ryin 

DiponncTM of £di>aik>n 
UaKmity of Souih CaroltrM 
CokMsKi. SC 



Dr. Puaike StBcjMot 
DtptruMm of Piycboio y 
UoK«njiy of Tonntutt 
3106 AuMk) P«y B)d|^ 
KM»4Kt, IK 379144900 

Mr. Dr«vSM>^ 

HPRDCCod€«2 

SiA Di<|0. CA 92:524000 

LfiM^ Scbotf 

Piycbo*0| k» l A OutiMiut^ 

FoundMioni 
Coltft of Educitioo 
UnKtniiy of 

Giy. lA 5242 

Dr. Mtiy Scbna 
4)00 PirtAidt 
CMkM,CA 93000 

Dr. DtflSt|>l 

Kfvy Pwonod Rd;D Ctmv 

Smi Dicfa CA 92152 

Dr. Ro^ Sbtt^ 
Vwmtiiy of ItUnoii 
DtptMOTt of SuUMioi 
101 nSniHa 
725 SomOi Wri^bi Sc 
Chtapttp, lU 4)830 

Dr. Kjuum SbigcfBtMi 
7'^24 Kuftnuou'Kiipn 
PajSttAi 251 
JAPAN 

Dr. RsndU Sbusikv 
Hm^ RtMtreto Ubontocy 
Codt 5510 

4555 OmSock Avcnut, SW. 
Wttbjn^ DC 3Q375.5O00 

Dr. Rkhtrd L Snow 
Stbool of EduoMMQ 
SuAford UoKmity 
SufifcriCA 9430S 

Dr. Rkfatrd C Sortmto 
Kt^ Ptnoond RJbD Ctmtr 
Sm) D;ctt^ CA 921524000 

Dr. Judy Spny 
ACT 

P.a Bot 141 
lo«t C(y. lA 5243 

Dr. Mftnhi S(ockin| 
EduoiiMoal T«Un| Scfviot 
PntMttoa NJ 00541 

Dr. Pt(« Stolotr 
CtMtr for Nfvil AiuK^ 
4401 Ford Avtftut 
P.O. Bot 14240 
AkMdnt,VA 2>02'0244 

Dr. N^ltttfli SioM 
Untvtnky of !ltir>oii 
DtforuMM of SuUiliGi 
101 IBM Hfel 
725 SouUi Wh|ht Sc 
Cbtoptign. It 4)S^ 

Dr. Htnbftr>n S^^BONAtthMi 
L»kirB4ory of Piycbontihc mi 

Ev^kiauoti Kmmrt^ 
Sdbooj of Educiiion 
Un^vcniiy of MiM*cfauMtU 
Ambcnc MA 0)003 

Mr. Btm) SympiOA 

Ktsy Pcnonr>d RlD Ccmtr 

Cod«-43 

Stn D»«|0. CA 9:i524W0 
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Dr. ;?ohe Tinpcy 
APOSIVSU BW^ 4]0 
Mitt% AflL DC 20)324441 

Dr. Kikuoi TiuuoU 

Pl^MMM. KJ OIMI 
Dr. Mi*Jf>M Tiuuoii 

Dr. Dffv^d IWco 
UaKwiiiy o( Kjaui 

Mr. TbottM 1 TbofBM 
Dv^wuMM oi f lycfaoiofiT 

Mr. Oaiy IImmmoo 

CbuBptipv It *1C0 
Dr. Rebtft TitiubFi^i 

222 Mitk SdcncM 
CoHmbit.MO 49211 

Dr. Ud^tn} I'utkm 
KD&DmWSim 

CbMTh!*!^ It iice 
Dr. Dif^ Vik 
2233 UnWvfv Avenu« 
9l Ptul MN 55114 
Dr. Fnnk U Viam 
Sm Vkf^ CA ni534800 
Dr. KcMri Wiinv 

Dr. Mk^Mi T. W»n«r 

Zi^mikiitli Piyd»olc|y DtpiiUMOi 

MlhvukM* ^ 53291 

Dr. MtDi'Ma Win| 

Mail Su)f (CiT 
PrbcfUHvNJ 08541 

Ck. T^MM A. SV»m« 
PAAAe*a«aiy MC034D 
r.O. 2ME2 
OUly>fM Gty. OR 73125 

Dr. BftMS W«Lin 

1100 & Wub. < 
A^Oftodrift, VA :^^14 

Dr. D«^ J. Wmi 
NiOO EAioti Hil 

^ L RKtr Ro*<l 
MiriMtpoi^ MS ^.^05<O.U4 

Dr. RooiM A WrOfflftn 
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Vbjor 3obR Wtteb 
i\F>IRL-%IOAN 

DkV OovflM V**f(ul 

51 

W»y PM^ofMd R&D Oour 

s*0 Dkfo, CA ni5:-4m 

Dr. K WUiiL 

Ciiiforevi 

Lm AA|«tei« CA W»» 10(1 

0««fe£ MAurr Rcproftcuii^ 
ATTN: Wo^pni NVikisrvU 

0*5300 Bo.in 2 
4000 BfAn^ftiiw SirMt, KW 
WMbin^^ DC 200U 

Dt Druet VilKam 

UrtMA, It «'i001 

Dr. Kil^ WWtK 

AvaijM Atfoinkireiioa 
100 W^pMdtM A>r% 5W 
WMfakpMvDC 2Vfn 

iilr. :$ab» K Wotf« 

Na^ PtTMinnd Ri[^l> Omt«r 

StA Dkffei, CA 921524000 

Dr. Ooorit Woa| 

Y«t AVWMM 

Km Y«t, h'Y 10021 

D«v Wt»ic« W^tck. m 
Nny PinooMi Ri:D Cm:« 
Ce^51 

Smi Oi«ve^ CA t2U:4IOO 

Dr. KfletBrp Y»mt»M 

CT 

RMd»kRo*4 

hUMiflUa, NJ 0A541 

Dr. W«»4y Ytn 

Dd Monu R«<uirct\ Ptrt 
Moounty. CA t.^ 

Dr. JoM^ U Youot 
NfttieniJ '>* imi FouAd*ikMi 
Kxm 32( 
noo G Sowi, N.W. 
WMhbfUMv DC 20::;50 

Mr. Artchf^ny R. ?Un 
HMkMtiiJ C«a»Mfi o( Sum 

(25 Nor\b Mkbipi) A^t 
Ma 1544 
Cbka^tt ^0(11 



